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We consider the problem of estimating the structural function in nonpara- 
metric instrumental regression, where in the presence of an instrument W a 

\Q response Y is modeled in dependence of an endogenous explanatory variable Z. 

The proposed estimator is based on dimension reduction and additional thresh- 

i 1 olding. The minimax optimal rate of convergence of the estimator is derived 

assuming that the structural function belongs to some ellipsoids which are in a 

^\ certain sense linked to the conditional expectation operator of Z given W. We il- 

i-^h lustrate these results by considering classical smoothness assumptions. However, 

the proposed estimator requires an optimal choice of a dimension parameter de- 
pending on certain characteristics of the unknown structural function and the 
conditional expectation operator of Z given W, which are not known in practice. 

t— I The main issue addressed in our work is a fully adaptive choice of this dimension 

parameter using a model selection approach under the restriction that the condi- 
tional expectation operator of Z given W is smoothing in a certain sense. In this 

t— l situation we develop a penalized minimum contrast estimator with randomized 

penalty and collection of models. We show that this data-driven estimator can 
attain the lower risk bound up to a constant over a wide range of smoothness 
classes for the structural function. 
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1. Introduction 



Nonparametric instrumental regression models have attracted increasing attention in the 
econometrics and statistics literature (c.f. Florens (2003), Darolles et al. (2002), Newey and 
Powell (2003), Hall and Horowitz (2007) or Blundell et al. (2007) to name only a few). In 
instrumental regression, the dependence of a response Y on the variation of an endogenous 
vector Z of explanatory variables is characterized by 



for some error term U. Additionally, a vector of exogenous instruments W such that 



is supposed to be observed. The nonparametric relationship is hence modeled by the struc- 
tural function ip. Typical examples are error-in-variable models, simultaneous equations 
or treatment models with endogenous selection. However, it is worth noting that in the 
presence of instrumental variables the model equations (l.la-l.lb) are the natural general- 
ization of a standard parametric model (see, e.g., Amemiya (1974)) to the nonparametric 
situation. This extension has first been introduced by Florens (2003) and Newey and Pow- 
ell (2003), while its identification has been studied e.g. in Carrasco et al. (2006), Darolles 
et al. (2002) and Florens et al. (2010). It is interesting to note that recent applications 
and extensions of this approach include nonparametric tests of exogeneity (Blundell and 
Horowitz (2007)), quantile regression models (Horowitz and Lee (2007)), or semiparametric 
modeling (Florens et al. (2009)) to name but a few. 

The nonparametric estimation of the structural function tp based on a sample of (Y, Z, W) 
has been studied in the literature. For example, Ai and Chen (2003), Blundell et al. (2007) 
or Newey and Powell (2003) consider sieve minimum distance estimators, while Darolles 
et al. (2002), Gagliardini and Scaillet (2006) or Florens et al. (2010) study penalized least 
squares estimators. The optimal estimation in a minimax sense has been studied by Hall 
and Horowitz (2005) and Chen and Reifi (2008). The authors prove a lower bound for the 
mean integrated squared error (MISE) and propose an estimator which can attain optimal 
rates. In the present work, we extend this result by considering not only the MISE of the 
estimation of <p but, more generally, a weighted risk (defined below), which allows us for 
example to consider the estimation of the derivatives of <p, too. We show a lower bound 
for this weighted risk and propose an estimator which can attain this lower bound up to a 
constant. 

It has been noticed by Newey and Powell (2003) and Florens (2003) that the nonparametric 
estimation of the structural function ip generally leads to an ill-posed inverse problem. More 
precisely, consider the model equations (l.la-l.lb). Taking the conditional expectation 
with respect to the instruments W on both sides in equation (1.1a) leads to the conditional 
moment equation 



Therefore, the estimation of the structural function ip is linked to the inversion of equation 
(1.2), which is not stable in general and hence an ill-posed inverse problem (for a compre- 
hensive review of inverse problems in econometrics we refer to Carrasco et al. (2006)). To 
cope with this instability, one generally employs regularization techniques which involve the 
choice of a smoothing parameter. It is well known that the resulting estimation procedure 



Y = <p(Z) + U 



(1.1a) 



E[U\W] = 



(Lib) 



E[Y\W] = E[cp{Z)\W}. 



(1.2) 
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can attain optimal rates only if this parameter is chosen in an appropriate way. This choice 
necessitates in general knowledge of characteristics of the structural function, such as the 
number of its derivatives, which are not known in practise. Thus, one of the essential prob- 
lems in this theoretical framework is the fully data driven choice of smoothing parameters. 
In the present work, we present an adaptive method which indeed does not depend on 
any properties of ip, but which still necessitates that some characteristics of the underlying 
operator are known. 

One objective of this paper is the minimax optimal nonparametric estimation of the struc- 
tural function (p based on an independent and identically distributed (i.i.d.) sample of 
(Y, Z, W) obeying (l.la-l.lb). After showing the lower risk bounds, we will follow an esti- 
mation approach often used in the literature. For the moment being, suppose that the struc- 
tural function can be developed by using only m pre-specified functions ei, . . . , e&, say p = 
Ylj=i[ t P]j e ji where now only the coefficients [<p]i, ■ ■ . , [<p]k are unknown. Thereby, the condi- 
tional moment equation (1.2) reduces to a multivariate linear conditional moment equation, 
that is, E[Y|W] = Y, j =i[ l P\j' E [ e j( z )\ w }- Notice that solving this equation is a classical 
textbook problem in econometrics (c.f. Pagan and Uriah (1999)). One popular approach 
is to replace the conditional moment equation by an unconditional one. Therefore, given 
k functions fi, ■ ■ ■ , fk one may consider k unconditional moment equations instead of the 
multivariate conditional moment equation, that is, E[V//(W)] = Yl^iitplj^fejiZ) fi(W)], 
I = 1, . . . , k. Notice that once the functions {//}f =1 are chosen, all the unknown quantities in 
the unconditional moment equations can be estimated by simply replacing the theoretical 
expectation by its empirical counterpart. Moreover, a least squares solution of the esti- 
mated equation leads to a consistent and asymptoticly normal estimator of the parameter 
vector ([y]j)f=i under very mild assumptions. The choice of the functions {fi}f =1 directly 
influences the asymptotic variance of the estimator and thus the question of optimal in- 
struments arises (c.f. Newey (1990)). Nevertheless, this approach is very simple and the 
estimator can be calculated with most statistical software. However, it has a major defect, 
since in a vast majority of situations an infinite number of functions {ej}j^i and associ- 
ated coefficients (Mj)j>i is needed to develop the structural function <p. The choice of 
the functions now reflects the a priori information (such as smoothness) about the 

structural function cp. However, if we consider also an infinite number of functions {fi}i^i 
then for each k ^ 1 we could still consider the least squares estimator described above. 
Notice that the dimension k plays the role of a smoothing parameter and we may hope 
that the estimator of the structural function ip is also consistent as k tends to infinity at a 
suitable rate. Unfortunately, this is not true in general. Let ipk ■= Ylj=xYPk]j^j denote a 
least squares solution of the reduced unconditional moment equations, that is, the vector of 
coefficients ([<p k ]j^=i minimizes the quantity ZtiMY fl(W)} - £)J =1 ftEfo (Z)MW)}} 2 
over all (/3j)^ =1 . Then, ip^ converges to the true structural function as k tends to infinity 
only under an additional assumption (defined below) on the basis {fj}j^i- In this paper, we 
show that in terms of a weighted risk a least squares estimator (p^ of <p based on a dimension 
reduction together with an additional thresholding can attain optimal rates of convergence, 
provided an optimal choice of the dimension parameter k. It is worth to note that all the 
results in this paper are obtained without any additional smoothness assumption on the 
joint density of (Y, Z, W). In fact we do not even impose the existence of such a density. 
Our main contribution is the development of a method to choose the dimension parameter k 
in a fully data driven way, that is, not depending on characteristics of p, and assuming only 
that the underlying conditional expectation operator is smoothing in a sense to be precised 
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below. The central result of the present paper states that for this automatic choice k, the 
least squares estimator p? can attain the lower bound up to a constant, and is thus minimax- 
optimal. The adaptive choice of k is motivated by the general model selection strategy 
developed in Barron et al. (1999). Concretely, following Comte and Taupin (2003), k is the 
minimizer of a penalized contrast. Note that Comte and Taupin (2003) consider a density 
deconvolution problem. We illustrate all of our results by considering the estimation of 
derivatives of the structural function under a smoothing conditional expectation operator. 
Typically, two types of such operators are distinguished in the literature, finitely or infinitely 
smoothing. It is interesting to note that Loubes and Marteau (2009) propose an adaptive 
estimator for the case where the operator is known to be finitely smoothing. They derive 
oracle inequalities and obtain convergence rates which differ from the optimal ones by a 
logarithmic factor. We underline that in contrast to this, we provide in this work a unified 
estimation procedure which can attain minimax-optimal rates in either of the both cases. 
In other words, our estimation procedure attains optimal rates without knowing in advance 
if the operator is finitely or infinitely smoothing. 

This article is organized as follows. In the next section, we develop the minimax theory 
for the nonparametric instrumental regression model with respect to the weighted risk. We 
derive, as an illustration, the optimal convergence rates for the estimation of derivatives in 
the finitely and in the infinitely smoothing case. Section 3 is devoted to the construction of 
the adaptive estimator. An upper risk bound is shown and convergence rates for the finitely 
and infinitely smoothing case are found to coincide with minimax optimal ones. All proofs 
are deferred to the appendix. 

2. Minimax optimal estimation 

In this section, we develop a minimax theory for the estimation of the structural function 
and its derivatives in nonparametric instrumental regression models. 

2.1. Basic model assumptions. 

It is convenient to rewrite the moment equation (1.2) in terms of an operator between 
Hilbert spaces. Let us first introduce the Hilbert Spaces 

L| = {if : R p -> M | \\tp\\% := ^[p 2 {Z)\ < oo}, 
L 2 W = {V> : R q -> R | U\\w : = n^\W)} < oo}, 

endowed with the inner products (<p,<p)z = ^ i [ t p{Z) l p(Z)], p,p G L 2 Z , and (ip,tp)w = 
E[V>(W)V>(W)], ?p,ip S Lyy, respectively. Then the conditional expectation of Z given W 
defines a linear operator Tip := M[tp(Z)\W], ip G L z , which maps L z into Lyy. In this 
notation, the moment equation (1.2) can be written as 

g :=~E[Y\W]=~E[(p(Z)\W]=:T(p, (2.1) 

where the function g belongs to Lyy. Estimation of the structural function <p is thus linked 
to the inversion of the conditional expectation operator T and it is therefore called an in- 
verse problem. Moreover, we suppose throughout this paper that the operator T is compact, 
which is the case under fairly mild assumptions (c.f. Carrasco et al. (2006)). Consequently, 
unlike in a multivariate linear instrumental regression model, a continuous generalized in- 
verse of T does not exist as long as the range of the operator T is an infinite dimensional 
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subspace of Lyy. This corresponds to the setup of ill-posed inverse problems, with the ad- 
ditional difficulty that T is unknown and has to be estimated. In what follows, we always 
assume that there exists a unique solution <p G L 2 Z of equation (2.1), in other words, that g 
belongs to the range of T, and that T is injective. For a detailed discussion in the context 
of inverse problems see Chapter 2.1 in Engl et al. (2000), while in the special case of a 
nonparametric instrumental regression we refer to Carrasco et al. (2006). 

2.2. Complexity: a lower bound 

In this section we show that the obtainable accuracy of any estimator of the structural 
function (p is essentially determined by additional regularity conditions imposed on <p and 
the conditional expectation operator T. In this paper, these conditions are characterized 
through different weighted norms in l? z with respect to a pre-specified orthonormal basis 
{ e j}j>i of L 2 Z . We formalize these conditions as follows. 

Minimal regularity conditions. Given a strictly positive sequence of weights w := (wj)j^i, 
we denote by ||-|| the weighted norm given by 

oo 

ll/IU:=E™;K/> e ;H 2 > v /e4. 

3=1 

We shall measure the accuracy of any estimator <p of the unknown structural function 
in terms of a weighted risk, that is E||</? — for a pre-specified sequence of weights 
uj := (ujj)j^i. This general approach allows as to consider not only the estimation of the 
structural function itself but also of its derivatives, as shown in section 2.4 below. Moreover, 
given a sequence of weights 7 := (7j)j>1 we suppose, here and subsequently, that for some 
constant p > the structural function (p belongs to the ellipsoid 

^:={/€L|:|M|?<p}. (2.2) 

The ellipsoid J-^ captures all the prior information (such as smoothness) about the unknown 
structural function 99. Furthermore, as usual in the context of ill-posed inverse problems, 
we specify the mapping properties of the conditional expectation operator T. Therefore, 
consider the sequence (||Tej [|w)j>l> which converges to zero since T is compact. In what 
follows, we impose restrictions on the decay of this sequence. Denote by T the set of all 
injective compact operator mapping L z into L 2 ^. Given a strictly positive sequence of 
weights A := (Aj)jj>i and d ^ 1, we define the subset of T by 

Tf:={TeT: \\f\\l/d£ \\Tff w < d ||/|||, V/ G h\\ (2.3) 

Notice that for all T G T\ it follows that d~ l ^ ||Tej||^/Aj ^ d. Furthermore, let us denote 
by T* :L 2 W ^ L\ the adjoint of T which satisfies T*ip = E,[ip(W)\Z] for all if) G L 2 W . If now 
T G T and if {ej}j^i are the eigenfunctions of T*T, then the sequence A specifies the decay 
of the eigenvalues of T*T. All results of this work are derived under regularity conditions 
on the structural function (p and the conditional expectation operator T described by the 
sequences 7 and A, respectively. However, below we provide illustrations of these conditions 
by assuming a -cregular decay» of these sequences. The next assumption summarizes our 
minimal regularity conditions on these sequences. 
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Assumption Al Let 7 := (7j)jeN> u := (ujj)j^ and A := (A,-)jgjsj be strictly positive 
sequences of weights with 70 = ujq = Ao = 1 andT := X^'eNTj -1 < 00 > suc ^ (^n/7n)neN 
and (A n ) ng N o^e non-increasing, respectively. 

It is worth noting that the monotonicity assumption (w n /7n)n6N only ensures that 
is finite, and hence the weighted risk is a well-defined measure of accuracy for estimators 
of if. Heuristically, this reflects the fact that we cannot estimate the s + 1-th derivative if 
the structural function has only s derivatives. Moreover, in the illustration given in section 
2.4, the additional assumption T := X)jeN7j < 00 can De interpreted as a continuity 
assumption on p. 

The lower bound. The next assertion provides a lower bound for the weighted risk which 
extends the result of Chen and Reifi (2008), who have recently shown a lower bound of the 
mean integrated squared error. 

Theorem 2.1 Suppose that the i.i.d. (Y, Z, W) -sample of size n obeys the model (l.la-l.lb), 
that the error term U belongs to U a := {U : MU\W = and E,U 4 \W < a 4 }, a > and that 
svpj-^i'Ei[e'j(Z)\W] ^ 77, 77 ^ 1. Consider sequences 7, co and A satisfying Assumption Al 
such that the conditional expectation operator T associated to (Z,W) belongs to T^, d 1. 
Define for all n ^ 1 

fc n : = fc n(7> A,w) := argminjmaxf — , S~] — ) } and 

K :=i£( 7 ,A,a;) 

J/ in addition k:= mf^ijX-R*)- 1 min(a; jfc «7~ 1 , ^(nAz) -1 )} > and cr 4 > 8(3+2/9 2 r 2 ), 
i/ien /or all n ^ 1 and for any estimator p of tp, we have 

sup sup E\\!p- ip\\l > j min ( p, - 1 - J R* n . 

Remark 2.2 The proof of the last assertion is based on Assuoad's cube technique (c.f. Ko- 
rostolev and Tsybakov (1993)), which consists in constructing 2 fc ™ candidates of structural 
functions which have the largest possible || -^-distance but are still statistically non distin- 
guishable. In the last theorem, the additional moment condition sup J -^ 1 E[ej(Z)|W] ^ n is 
obviously satisfied if the basis functions {ej} are uniformly bounded (e.g. the trigonometric 
basis considered in Section 2.4). However, if V denotes a Gaussian random variable with 
mean zero and variance one, which is moreover independent of (Z, W), then the additional 
condition <r 4 ^ 8(1 + 2p 2 T 2 n) ensures that for all structural functions <p £ J 7 ^, the error 
term U := V — <p(Z) + [Ty?](W) belongs to U a . This specific case is only needed to simplify 
the calculation of the distance between distributions corresponding to different structural 
functions (a similar assumption has been used by Chen and Reifi (2008)). On the other 
hand, below we derive an upper bound assuming that the error term U belongs to U a and 
that the joint distribution of (Z, W) fulfills additional moment conditions. In this situation, 
Theorem 2.1 obviously provides a lower bound for any estimator as long as a is sufficiently 
large. Note further that this lower bound tends only to zero if is a null sequence. 

In other words, in case 7 = 1, uniform consistency over all ip such that \\<p\\% ^ p can only 
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be achieved with respect to a weighted norm weaker than the L^-norm, that is, if to is a 
zero-sequence. This obviously reflects the ill-posedness of the underlying inverse problem. 
Finally, it is important to note that the regularity conditions imposed on the structural 
function tp and the conditional expectation operator T involve only the basis {ej}j^i in L z . 
Therefore, the lower bound derived in Theorem 2.1 does not capture the influence of the 
basis {fi}i^i in used to construct the estimator. In other words, the proposed estimator 
of ip can only attain this lower bound if {fi}i^i is appropriately chosen. □ 

2.3. Minimax-optimal Estimation by dimension reduction and thresholding. 

In addition to the basis {ej}j^± of l? z considered in the last section, we introduce now also a 
basis {fi}i^i in L^. In this section we derive the asymptotic properties of the least squares 
estimator under minimal assumptions on these two bases. More precisely, we suppose that 
the structural function ip belongs to some ellipsoid F^ and that the conditional expectation 
satisfies a link condition, i.e., T € 7~^. Furthermore, we introduce an additional condition 
linked to the basis {fi}i^i- Then, under slightly stronger moment conditions, we show that 
the proposed estimator attains the lower bound derived in the last section. All these results 
are illustrated under classical smoothness assumptions at the end of this section. 

Matrix and operator notations. Given k ^ 1, and Fk denote the subspace of L 2 Z and 
Lyy spanned by the functions {ej}j =1 and {fi}f =1 , respectively. Ey. and (resp. Fk and 
Ffc) denote the orthogonal projections on E k (resp. Fk) and its orthogonal complement 
St (resp. F^), respectively. Given an operator (matrix) K, the inverse operator (matrix) 
of K is denoted by K , the adjoint (transposed) operator (matrix) of K by K l . [ip], [i/j] 
and [K] denote the (infinite) vector and matrix of the function tp £ L z , ip 6 L^, and the 
operator K : I? z ->• Lyy with the entries [<p]j = (<p,ej), = (ip,fi) and [K]ij = (Kej,fi), 
respectively. The upper k subvector and k x k submatrix of [</?], [ijj] and [K] is denoted 
by [ip]k an d [K]ki respectively. Note, that [K 1 ]^ = [K]\.. The diagonal matrix with 
entries v is denoted by diag(u) and the identity operator (matrix) is denoted by /. Clearly, 
[Ekp]k = [<p]k an d if we restrict F^KE^ to an operator from £k into Fk, then it has the 
matrix [K]k- Moreover, if v G M fc then \\v\\ denotes the Euclidean norm of v and given a 
(k x k) matrix M let ||M|| := supii^n^^ || denote its spectral-norm and tr(Af) its trace. 
Consider the conditional expectation operator T associated to the regressor Z and the 
instrument W. If [e(Z)] and [/(W)] denote the infinite random vector with entries ej(Z) and 
fj(W) respectively, then [T]k = lE[f(W)]k[e(Z)]* k which is throughout the paper assumed to 
be non singular for all k ^ 1 (or, at least for sufficiently large k), so that [T]^ 1 always exists. 
Note that it is a nontrivial problem to determine in under what precise conditions such an 
assumption holds (see e.g. Efromovich and Koltchinskii (2001) and references therein). 



Definition of the estimator. Let (Yi, Z\, W\), ■ ■ . , (Y n , Z n , W n ) be an i.i.d. sample of 
(Y,Z,W). Since [T] k = E[/(W)]fc[e(Z)j| and [g] h = EY[f(W)] k we construct estimators 
by using their empirical counterparts, that is, 

n n 

[f\ k := (l/n)J2[f(WiMe(Zi)]l and [g] k := (1/n) ^ ^[/(^i)]fc. (2.5) 
i=l i=l 
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Then the estimator of the structural function ip is defined by 

k 

<Pk '■= /jfikljCj with [(fklk 

3=1 

where the dimension parameter k = k(n) has to tend to infinity as the sample size n in- 
creases. In fact, the estimator tp^ takes its inspiration from the linear Galerkin approach 
used in the inverse problem community (c.f. Efromovich and Koltchinskii (2001) or Hoff- 
mann and Reifi (2008)). 

Extended link condition. Consistency of this estimator is only possible if the least squares 
solution (fk = X)j=i[Vfc]i e j with [<pk\k = [^Ifc^blfc converges to the structural function (p as 
k oo, which is not true in general. However, the condition sup fcgN || [T]^ 1 [TE^]k\\ < oo 
is known to be necessary to ensure convergence of tp^. Notice that this condition involves 
now also the basis {fi}i^>i in Lyy. In what follows we introduce an alternative but stronger 
condition to guarantee the convergence, which extends the link condition (2.3), that is, 
T G ■ We denote by T^ D for some D ^ d the subset of 7^ A given by 

T d X D := {T £ T d X : sup||[diag(A)]^ 2 [T] fc 1 || 2 < d\. (2.7) 

Remark 2.3 The link condition (2.3) implies the extended link condition (2.7) for a suitable 
D > if {ej} and {fj} are the eigenfunctions of T and if [T] is only a small perturbation 
of diag(A 1 / 2 ), or if T is strictly positive (for a detailed discussion we refer to Efromovich 
and Koltchinskii (2001) and Cardot and Johannes (2010)). We underline that once both 
bases {ej}j^± and {fi}i^i are specified, the extended link condition (2.7) restricts the class of 
joint distributions of (Z, W) to those for which the least squares solution <p^ is inconsistent. 
Moreover, we show below that under the extended link condition the least squares estimator 
of ip given in (2.6) can attain minimax-optimal rates of convergence. In this sense, given 
a joint distribution of (Z, W), a basis satisfying the extended link condition can be 

interpreted as a set of optimal instruments. Moreover, for each pre-specified basis {ej}fei, 
we can theoretically construct a basis {fi}i^i of optimal instruments such that the extended 
link condition is not a stronger restriction than the link condition (2.3) (see Johannes and 
Breunig (2009) for more details). □ 

The upper bound. The following theorem provides an upper bound under the extended 
link condition (2.7) and an additional moment condition on the bases, more specific, on 
the random vectors [e(Z)] and [/(W)]. We begin this section by formalizing this additional 
condition. 

Assumption A2 There exists rj ^ 1 such that the joint distribution of (Z, W) satisfies 

(i) su Vj&N ne){Z)\W] < n 2 and su PleN E[#(W)] < rf; 

(ii) sup ijieN ¥ar(e j (Z)/;(H / )) < rf and 
su Pj - ieN E| ej (Z)/j(W0 -nej(Z)fi(W)]\ 8 ^ 8!r ? 6 ¥ar(e,(Z)/K^)). 



PVt?]fc, i iim-i 



if [T]k is nonsingular 
and Hff]^ 1 !! < y/H, (2.6) 

otherwise, 
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It is worth noting that any joint distribution of (Z, W) satisfies Assumption A2 for suffi- 
ciently large rj if the bases and {fi}i^i are uniformly bounded. Here and subse- 
quently, we write <x n ^ b n when there exists a numerical constant C > such that ci n ^ C b n 
for all n G N and a n ~ b n when a n < 6 n and 6 n < in simultaneously. 

Theorem 2.4 Suppose that the i.i.d. (Y, Z, W)-sample of size n obeys the model (l.la-l.lb) 
and that the joint distribution of (Z, W) fulfills Assumption A2 for some r] ^ 1. Consider 
sequences 7, oj and A satisfying Assumption Al such that the conditional expectation op- 
erator T associated to (Z,W) belongs to T^ D , d,D^ 1. Let 72* and k be as given in 
Theorem (2.1). If in addition sup fcgN fe 3 /7fc =: £ < 00, then we have for all n G N wii/j 
(fc*) 3 ^ 4D(/k that 



sup sup E[|£ fc * < D^^ + ffDdp)^ 
• Ud(/k + max ( 1, max ^ ] + (A;*) 3 



P(||[T]^-[T] M f> 4Dy 



2 A fc* \ 



1/4. 



+ P p(ll[%-[Tyi 2 > 4D/ . 

Remark 2.5 We emphasize that the bound in the last theorem is not asymptotic. Moreover, 
it is worth noting that the term max (1, Xk*/^k* max l<j<ifc* Wj/^j) i s uniformly bounded 
by a constant if oj/X is non decreasing, which we suppose from now on. However, this is 
not the case in general. □ 

A comparison with the lower bound from Theorem 2.1 shows that the last assertion does not 
establish the minimax-optimality of the estimator. However, the upper bound in Theorem 
2.4 can be improved by imposing a moment condition stronger than Assumption A2. To 
be more precise, consider the centered random variable ej(Z)fi(W) — lE[ej(Z)fi(W)]. Then 
Assumption A2 (ii) states that its 8th moment is uniformly bounded over j, I G N. In 
the next Assumption we suppose that these random variables satisfy Cramer's condition 
uniformly, which is known to be sufficient to obtain an exponential bound for their large 
deviations (c.f. Bosq (1998)). 

Assumption A3 There exists rj ^ 1 such that the joint distribution of (Z, W) satisfies 
Assumption A2 and in addition 

(lii) BU Pj -, £N E| ei (Z)yi(W0 - n^{Z)fi(W)]\ k < t/- 2 *;! Var( ej (Z)/KW/)), k = 3,4,... . 

It is well-known that Cramer's condition is fulfilled in particular if the random variable 
ej(Z)fi(W) — ~E[ej(Z)fi(W)] is bounded. Whenever the bases {ej}j^i and {fi}i^i are 
uniformly bounded it follows thus again that any joint distribution of (Z, W) satisfies As- 
sumption A3 for sufficiently large 77. On the other hand, we show that under this additional 
condition the deviation probability tends to zero faster than i?*. Hence, the rate i?* is 
optimal and (p^* is minimax-optimal, which is summarized in the next assertion. 

Theorem 2.6 Suppose that the assumptions of Theorem 2.4 are satisfied. In addition, 
assume that the joint distribution of (Z, W) fulfills Assumption A3 and that the sequence 
(ui/\) is non- decreasing. For all n G N with (log k*)/k* ^ k/ (280Dr] 2 () and (logi?*)//c* ^ 
-k/(40Dt] 2 () we have 

sup sup E||^-p||* <£VC«~V + rfldp)/C. 
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Remark 2.7 From Theorems 2.1 and 2.6 follows that the estimator (p^* attains the optimal 
rate i?* for all sequences 7, uj and A satisfying the minimal regularity conditions from 
Assumption Al. Let us elaborate on the interesting role of the sequences 7, uj and A. 
Theorem 2.1 and 2.6 show that the faster the sequence A decreases, the slower the obtainable 
optimal rate of convergence becomes. On the other hand, a faster increase of 7 or decrease of 
uj leads to a faster optimal rate. In other words, as expected, a structural function satisfying 
a stronger regularity condition can be estimated faster, and measuring the accuracy with 
respect to a weaker norm leads to faster rates, too. □ 

2.4. Illustration: estimation of derivatives. 

To illustrate the previous results, we will describe in this section the prior information 
about the unknown structural function ip by its level of smoothness. In order to simplify 
the presentation, we follow Hall and Horowitz (2005) (where a more detailed discussion of 
this assumption can be found) and suppose that the marginal distribution of the scalar 
regressor Z and the scalar instrument W are uniformly distributed on the interval [0,1]. 
It is worth noting that all the results below can be extended to the multivariate case in a 
straightforward way. In the univariate case, it follows that both Hilbert spaces l? z and L 2 ^ 
are isomorphic to L 2 [0, 1], endowed with the usual norm ||-|| and inner product (•,•)• 
In the last sections, we have seen that the choice of the basis {&j}j^i is directly linked to 
the a priori assumptions we are willing to impose on the structural function. In case of 
classical smoothness assumptions, it is natural to consider the Sobolev space of periodic 
functions W r , r ^ 0, which for an integer r is given by 

W r = {/ G H p : /0(O) = j = 0, 1, . . . ,r - l}, 

where H r := {/ G L 2 [0,1] : /fr" 1 ) absolutely continuous , /W G L 2 [0, 1]} is a Sobolev 
space. Moreover, let us introduce the trigonometric basis 

1/11 := 1, i>2j(s) ■= V2cos(2irjs), ^ 2 j+i( s ) : = sin(27rjs), s G [0, 1], j G N. 

It is well-known that the union (JneN^S °^ ellipsoids J 7 ™ in L 2 [0, 1] defined by using the 
trigonometric basis {ej = tpj} and the weight sequence w\ = 1, Wj = j 2r , j ^ 2 in def- 
inition (2.2) coincides with the Sobolev space of periodic functions W r (c.f. Neubauer 
(1988a,b)). Therefore, let us denote by W£ := J 7 ^, c > an ellipsoid in the Sobolev space 
W r - In the remainder of this section we will suppose that the prior information about the 
unknown structural function tp is characterized by the Sobolev ellipsoid Wp , p > 0, i.e., that 
<p is p ^ times differentiable. In this illustration, we consider the estimation of derivatives 
of the structural function ip. Therefore, it is interesting to recall that, up to a constant, 
for any function h G Wp the weighted norm \\h\\ u with ujq = 1 and ujj = j 2s , j ^ 2, equals 
the L 2 -norm of the s-th weak derivative for each integer ^ s ^ p. By virtue of this 
relation, the results in the previous section imply also a lower as well as an upper bound 
of the L 2 -risk for the estimation of the s-th weak derivative of (p. Finally, we restrict our 
attention to conditional expectation operator T G with either 

[p-A] a polynomially decreasing sequence A, i.e., Ao = 1 and Xj = j~ 2a , j ^ 2, for some 
a > 0, or 

[e-A] an exponentially decreasing sequence A, i.e., Ao = 1 and Xj = exp(— j 2a ), j ^ 2, for 
some a > 0. 
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It is easily seen that the minimal regularity conditions given in Assumption Al are satisfied 
if p > 1/2. Roughly speaking, this means that the structural function is at least continuous. 
The lower bound presented in the next assertion follows now directly from Theorem 2.1. 
Note that the additional condition, sup^ E[ej(Z)|W] ^ 77, 77 ^ 8, is satisfied since the 
trigonometric basis is bounded uniformly by two. 

Proposition 2.8 Suppose an i.i.d. sample of size n from the model (l.la-l.lb). If ip 6 Wp, 

p > 1/2, then we have for any estimator tp^ of f^ s \ ^ s < p, 
[p-A] in the polynomial decreasing case that 

sup^ S up^ w , {E||£« - ^)|| 2 } > „-a<P-«)/(W-+i), 

[e-A] in the exponentially decreasing case that 

sap u&4a sup^ w; {E||^W - <^)|| 2 } > (logn)-^)/". 

In this section, the basis of Lyy is also given by the trigonometric basis {fi = In this 

situation, the additional moment conditions formalized in Assumption A3 are automatically 
fulfilled since both bases {ej}j^i and {fi}i^i are uniformly bounded. We suppose that the 
associated conditional expectation operator T satisfies the extended link condition (2.7), 
that is, T G T^ D - Thereby, we restrict the set of possible joint distributions of (Z,W) to 
those having the trigonometric basis as optimal instruments. As an estimator of <p^ s \ we 
shall consider the s-th weak derivative of the estimator (p^ defined in (2.6). Recall that for 
each integer ^ s ^ p, the s-th weak derivative of the estimator <pk is 

I 1 

ip k (t) = y~^(2z7rj) s / (fk( u ) ex P( — 2iftju)duexp(— 2iwjt). 

; e z ' 

Applying Theorem 2.4, the rates of the lower bound given in the last assertion provide, up 
to a constant, also an upper bound of the L 2 -risk of the estimator tp^\ which is summarized 
in the next proposition. We have thus proved that these rates are optimal and the proposed 
estimator <p) is minimax optimal in both cases. 

Proposition 2.9 Suppose that the i.i.d. (Y, Z,W)-sample of size n obeys the model (l.la- 
l.lb). Let ip £ Wp, p 3/2. For ^ s < p consider the estimator (p^* given in (2.6). 
[p-A] In the polynomial decreasing case with dimension parameter fc* ~ n i/(2p+2a+i) ^ 

sup ueu „ su PveW , p {E[|^g - ^ } || 2 } < „-^.)/(!W-a«+i). 

[e-A] In the exponentially decreasing case with k* n ~ (logn) 1 ^ 2 "), 

™9vm a ^ew; {mlt ~ <P {S) \\ 2 } < (logn)-^-^. 

Remark 2.10 We emphasize the interesting role of the parameters p and a characterizing 
the regularity conditions imposed on ip and T respectively: As we see from Theorem 2.8 
and 2.9, if the value of a increases, the obtainable optimal rate of convergence decreases. 
Therefore, the parameter a is often called degree of ill-posedness (c.f. Natterer (1984)). On 
the other hand, an increase of the quantity p leads to a faster optimal rate. In other words, 
as expected, a smoother structural function can be estimated faster. Finally, as opposed to 
the polynomial case, in the exponential case the smoothing parameter fe* does not depend 
on the value of p. It follows that the proposed estimator is automatically adaptive, i.e., 
it does not depend on an a-priori knowledge of the degree of smoothness of the structural 
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function ip. However, the choice of the smoothing parameter depends on the properties of 
T, more precisely, the value of a. □ 



3. Adaptive estimation under smoothness assumptions 

In this section, our objective is to construct a fully adaptive estimator of the structural 
function ip. Adaptation means that in spite of the conditional expectation operator T being 
unknown, the estimator should attain the optimal rate of convergence over the ellipsoid 
J-t} for a wide range of different weight sequences 7. However, we will suppose that the 
operator T is diagonal with respect to the trigonometric basis {ipj}. In this situation, 
for example, an operator with polynomially decreasing A having a degree of ill-posedness 
a behaves like a-times integrating, and hence it is also called finitely smoothing. On the 
other hand, when the sequence A is exponentially decreasing with degree of ill-posedness 
a, the operator behaves like integrating infinitely many times, and hence it is also called 
infinitely smoothing. Thus, this additional condition imposes in fact a smoothing condition 
on the unknown conditional expectation operator T. Even if we assume that the operator 
is smoothing, we do not impose any a-priori knowledge about the specific decay of A. 
Our starting point is the estimator given in (2.6), which in this situation is of the form 

with [g\- and \T\jj defined in (2.5). In the last section, we have shown that this estimator 
is minimax-optimal provided the dimension parameter k is chosen in an optimal way. In 
what follows, the dimension parameter k is chosen using a model selection approach via 
penalization. This choice will only involve the data and none of the sequences 7 and A 
describing the underlying smoothness. First, we introduce some sequences which are used 
below. 

Definition 3.1 

(%) For all k ^ 1, define Af. := maxi<j<jfc O0j/Xj, := maxi<y<j&(wj)vi/Aj with (q)vi '■= 
max(g, 1) and 

log(r fc V(fc + 2)) 

d k := feA fc — . 

log(fc + 2) 

Let further E be a non- decreasing function such that for all C > 



, fclogfa V(fc + 2)) ^ , . 
r k exp ( - 6ClQg(k + 2) ) < < 00 (3.2) 



and sup neN exp ( - K 2 C" 1 n 1 / 6 + | logn) < E(C) with K 2 = (y/2 - l)/(21>/2). 
(ii) Define a sequence N follows, 

iV n := iV n (A, d) := max j 1 < JV < n n 7 exp ^ - ^ (— J 

and 5jv/« ^ 1 f 
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It is easy to see that there exists always a function £ satisfying condition (3.2). Consider 
the estimator (pr defined by choosing the dimension parameter k such that 

k := argmin + c — 

for some constant c > 0. It is shown in Johannes and Schwarz (2010) and Comte and 
Johannes (2010) that such an estimator can attain minimax-optimal rates in the context 
of a deconvolution problem and a functional linear model respectively. However, this esti- 
mator is only partially adaptive, since the dimension parameter is chosen using a criterion 
function that involves the sequences N and 5 which depend on A and d. We circumvent this 
problem by defining empirical versions of these sequences. The fully adaptive estimator is 
then defined analogously to the one above, but uses the estimated rather than the original 
sequences. 

Definition 3.2 Let 5 := (S k ) k ^i, N := {N n ) n ^i, be as follows. 

~ ^~~-2 — 2 

(i) Given A k := maxi^ fc ojj[T]jj l-jmfi^fc [T]^ ^ l/n} and 

- — - —2 - — ■ 2 

% := maxo<j<fc(wj)vi[r]^ l{infi<j<fc [T] j:j ^ 1/n} let 



log(fc + 2) 



(ii) Given N% := argmax 1 ^ 7Vs g n { maxx^j^jv LOj /n ^ l}, let 

id • f \i T h\ 2 logn 

Jy n := argmin < — < 



It worth to stress that all these sequences do not involve any a-priori knowledge about 
neither the target function ip nor the operator T. Now, we choose the dimension parameter 
as 

k:= argmin / - \\f k \\l + 540E[Y 2 ] X. (3.3) 
l^k^N n I n J 

Throughout the paper we do not address the issue that the value E[Y 2 ] is not known in 
practice. Anyway, it can easily be estimated by its empirical counterpart. Moreover the 
constant 540, though suitable for the theory, may probably be chosen much smaller in 
practice by a simulation study (cf. Comte et al. (2006) in the context of a deconvolution 
problem). 

Our main result below needs the following Assumption. 

Assumption A4 The sequence N from Definition 3.1 (ii) satisfies the conditions 

Xj logn -I 

max — — - — and d mm A 7 - ^ 2 n. 

j>N n j{0Jj)vi ^dn l<j<N n 

By construction, these conditions are satisfied for sufficiently large n. However, let us 
illustrate them by the particular examples introduced in section 2.4. 



13 



Remark 3.3 Recall the distinction between finitely and infinitely smoothing conditional 
expectation operators discussed in section 2.4. The sequences from Definition 3.1 take the 
following values in either of the two cases. 

[fs] In the finitely smoothing case, we have 

Afc = k 2a+2s , 5 k ~ k 2a+2s+ \ N n ~ n l/(2a+2 S +l) > 

[is] In the infinitely smoothing case, we have 

A fc = k 2s exp(k 2a ), 5 k ~ fc 2a+2s+1 exp(A: 2a )(logA:)~ 1 , 

n log log n 



N n ~ log 



If {7a) 



(logn)( 2a+2s+1 )/( 2a ) 

It is easily verified that the sequence N satisfies Assumption A4 in either case . □ 

We are now able to state the main result of this paper providing an upper risk bound for 
the fully adaptive estimator. 

Theorem 3.4 Assume an n-sample of (Y, Z, W). Consider sequences u, 7, and A satisfying 
Assumption Al such that the conditional expectation operator T associated to (Z, W) belongs 
to T £ 7~d D , d, D 1 and is diagonal with respect to {ipj}- Let the sequences 5 and 
N be as in Definition 3.1 and suppose that Assumption A4 holds. Define further N l n := 
argmax^-^^ { jt~^[ ^ 4d l ° g n } . Consider the estimator (p^ defined in (3.1) with k given 
by (3.3). Then for all n ^ 1 

sup sup {E||^ - ip\\l } < {2pT + a 2 + ifdQ 

Uj . ( 1 \\ 1 f v ( {2pT + a 2 )Cd + Vu\ z 





( U k 




min < 


max — 






I \lk 


n J 



+ pmax <^ min 1, — \\ + - \ E — 2 + 1 

i>! l7i V "V J n I V V u\z ) 

where V v \ z := E[Var(£/|Z)] and Cd := (log 3d)/ log 3. 

Compare the last assertion with the lower bound given in Theorem 2.1. It is easily seen that 
if (oj/X) is non-decreasing, the second term in the upper bound of Theorem 3.4 is always 
smaller than the first one. Thus, in this situation the fully adaptive estimator attains the 
lower bound up to a constant as long as supfc^ 1 {<5fc/(2^ 1 < Sj < S fc oJj/Xj)} < 00 and if the optimal 
dimension parameter fc* given in Theorem 2.1 is smaller than N n , which is summarized in 
the next assertion. 

Corollary 3.5 Let the assumptions of Theorem 3.4 be satisfied. If in addition (w/A) is 
non- decreasing, sup fc ^ 1 {(5fc/(^ 1 ^^ fc a;j/Aj)} < 00 and sup ngN (fc*/A^) ^ 1, then 

sup sup {E||Sf — c/?|| 2 } = 0{R* n ), as n — > 00, 
where k* n and i£* are given in (2.4). 

It is worth to note that the additional assumptions in the last assertion are sufficient to estab- 
lish the order optimality of the estimator, but not necessary as it is shown in the example [is] 
below. 
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3.1. Illustration: estimation of derivatives (continued) 

The following result shows that even without any prior knowledge on the structural function 
<p and for all smoothing operators T, the fully adaptive penalized estimator automatically 
attains the optimal rate in the finitely and in the infinitely smoothing case. Recall that the 
computation of the dimension parameter k given in (3.3) involves the sequence (N%) n ^i, 
which in our illustration satisfies N% ~ n 1 /^ 11 ) since ujj = j 2s , j 1. 

Proposition 3.6 Suppose that the i.i.d. (Y, Z,W)-sample of size n obeys the model (1.1a- 
1.1b) and that U G U a , a > 0. Consider the estimator <p? given in (2.6) with k defined 
by (3.3). 

[fs] In the finitely smoothing case, we obtain 

™Pu&i. bup^ {E||4 S > - <^)|| 2 } = 0(n- 2 (^)/(^«+i)). 

[is] In the infinitely smoothing case, we have 

su VueUa sup veW P {E||4 S) - <^)|| 2 } = 0((logn)-(P-*)/ a ). 

4. Concluding remarks and perspectives 

We have proposed in this work a new kind of estimation procedure for the structural function 
and its derivatives in nonparametric instrumental regression and proved that they can attain 
optimal rates of convergence. These estimators require an optimal choice of a dimension 
parameter depending on certain characteristics of the unknown structural function and 
the conditional expectation operator of Z given W, which are not known in practice. By 
using a penalized minimum contrast estimator with randomized penalty and collection of 
models we have constructed a fully adaptive choice of this dimension parameter, which 
can attain minimax-optimal rates if the conditional expectation operator of Z given W 
is finitely or infinitely smoothing. However, in case the conditional expectation operator 
is not smoothing anymore it is still an open question if this data driven rule leads to a 
minimax-optimal estimation procedure. We are currently exploring this issue. 

A. Proofs 

A.l. Proofs of section 2 
Proof of the lower bound. 

Proof of Theorem 2.1. Consider (Z,W) with associated conditional expectation operator 
T G T^- Given Q := remin(p, l/(2d)) and a n := i?n(X^=i UJ j/(^j n ))~ 1 we consider the 
function tp := ((an/re) 1 / 2 X]j=i \- 1 ^ e j belonging to J 7 ^, which can be realized as follows. 
Since (7/w) is monotonically increasing it follows ||^|| 2 ^ pn(^k*/^k*)R n ^ P by using 
successively the definition of a n and k. Obviously for any 6 := (9j) G {-l,l} fc ™, the 
function (pg := Y2j=i 9jW\j e j belongs to J 7 ^ too, and hence it is a possible candidate of the 
structural function. Let V be a Gaussian random variable with mean zero and variance one 
(V ~ W(0, 1)) which is independent of (Z, W). Let U g := [T<p e ](W) - <p (Z) + V, then Ug 
belongs to U a for all a 4 ^ 8(3 + 2p 2 F 2 rj). Obviously we have EC/elVF = 0. Moreover, by 
employing twice the Cauchy-Schwarz inequality the condition T = ^j g j^7j~ 1 < 00 together 
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with supjE[e4(Z)|W] < r? implies |E/(Z)|iy| 4 < p 2 r J] ieN 7 7 1 E[e 4 (Z)| < ^r 2 ?? for 
all / € TSf. From this estimate we conclude E[</?|(Z)|W] < r}p 2 Y 2 and |[T^](Ty)| 4 < 
E[^(Z)|W] ^ r]p 2 T 2 . By combination of the last two bounds we obtain Eft/glW] < 
8{2r ?/ o 2 r 2 + 3}. Consequently, for each 9 i.i.d. copies (Y^Z^Wi), 1 ^ % < n, of (Y, Z, W) 
with y := ipg(Z) + [/# form an n-sample of the model (l.la-l.lb) and we denote their 
joint distribution by Pg. In case of Pg the conditional distribution of Y{ given W{ is then 
Gaussian with mean [Tipg](Wi) and variance 1. Furthermore, for j = 1, . . . , k* n and each 9 
we introduce 0^) by 9^ = 9i for j ^ I and 9^ = —9j. Then, it is easily seen that the 
log-likelihood of Pg with respect to Pgu) 1S given by 



log( 



dPg 

dP t 



go) 



: ^ 2(3*5 - [TwWMMjlTeMWi) + 2[<p} 2 £ \[T ej }(Wi)\ 2 . 
i=l i=l 

Its expectation with respect to satisfies Ep fl [log(dPe/dP e (j))] = 2n[(/2] 2 ||Tej|| 2 y ^ 2n<i[(/9] 2 Aj 
by using T G 7^ A . In terms of Kullback-Leibler divergence this means KL(Pg, Pqq)) ^ 
2 (in [y?]|Aj. Since the Hellinger distance H(Pg, P e u)) satisfies H 2 (Pg, Pgu)) ^ KL(Pg, P e u)) 
it follows by employing successively the definition of <p, the property a n ^ and the 
definition of £ that 

H 2 (P e ,P eU) ) < 2dn[ ¥ >] 2 A j < 2dC«n < 1. (A.l) 



Consider the Hellinger affinity p(Pg, Pgu)) = f \/dPgdPg(j) then we obtain for any estimator 
<p of (p that 



¥>0«)Jjl ' J live - ^ew)Jil 

[v-toW 2 AT3 \ 1/2 , f f P-Mf , D y/2 



< / r W tf)) + / I, i . 2 dPg 

Rewriting the last estimate by using the identity p(Pg, PgU)) = 1 — \H 2 {Pg, P s u)) an d (A.l) 
we obtain 

1,, , ,o 1. 

2 1 

Combining the last lower bound and the following reduction scheme is the key argument of 
this proof: 



{E fl |[y> - ipg]j\ 2 + E eU) \[lp - <p e u))j\ 2 } ^ ^\[ip e ~ <PgU)]j\ 2 = ^M]- 



sup sup Epjyj - (f\\l ^ sup E, Pe \\(p - (pg\\l 



0e{-l,i}*£ i =1 

= o~K E E f {EpJ[^- + Ep, w |[£- ^cflfcl 2 } 

ee{-i,i} fc n i =1 



A** A** 

E EfM? = ^E!: 

ee{-i,i} fc " i =1 i =1 



2 K 4 ^Jj 4 n \ . 

fle{-i,i} fc " i =1 j^ 1 

Hence, from the definition of £ and a n we obtain the lower bound given in the theorem. □ 
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Proof of the upper bounds. We begin by defining and recalling notations to be used in 
the proofs of this section. Given k > 0, denote ifk := X^=i [</?£:].? e j with [ipk]k = [^lfc 1 [s']fc 
which is well-defined since [T] k is non singular. Then, the identities [T(ip — (fk)]k = an d 

[tpk - E k ip]k = [?% 1 \ TE k <p\k hold true. Furthermore, let [E]k := [T] k - [T]k and define 
vector [B]k and [S] k by 

1 n 1 n 

/I . 71 . 

J = l 1=1 

where — [T] fc [<£fc]fc = [B]k + [S]k_. Note that E[-B]& = due to the mean independence, 
i.e., E(i7|W) = 0, and that EfS'Jfc = PV]/% — \T(p k ] k = 0. Moreover, let us introduce the 
events 

n := mCw < v^}, n 1/2 ■.= {||[sy ||[r]^|| < 1/2} 

^ C == {WC\\ > v^} and J2° /2 = {||[%|| ||[T]^|| > 1/2}. 

Observe that U 1/2 C in case y/n ^ 2||[T] fc 1 ||. Indeed, if || [%|| || [T]~ 1 1| < 1/2 then the 

identity p\ = [T] k {I + [T^JE^} implies || < 2||[T]^ 1 || by the usual Neumann 

series argument. Moreover, in case T satisfies the extended link condition (2.7), that is 
T G T d x D , then 2\\[T] k 1 \\ < 2||[diag(A)]^ 1/2 ||||[diag(A)]i /2 [r]^ 1 || < 2^/D/X~ k since A is non 
increasing. Finally, given k*, R* and k defined in Theorem 2.1 we have K~ l u k *^7} ^ 

R* n ^ Ylj=i w j( n ^j) 1 hy using successively the definition of k and R* n . By combination 
of the last estimate and the condition sup fcgN /c 3 7^ 1 ^ £ it follows that (A:*) 3 (nAfc* )" 1 ^ 
K- l (ktf-il} < k -1 C- Thus, for all n G N with (/ci) 3 > 4L>C« _1 we have 411 [T]v} II 2 < 

n n 

ADX^} ^ n4.D£re — :L (A;^) _3 ^ n, and hence fij/2 C f&. These notations and results will be 
used below without further reference. 

We shall prove in the end of this section three technical lemmas (A.l - A. 3) which are used 
in the following proofs. 

Proof of Theorem 2-4- Define (p k * := y>fc*lfi and decompose the risk into two terms, 

n\$k* - <p\\l < 2{E||£ fc * - \\l + El^fc. - } =: 2{A 1 + A 2 }, (A.2) 

which we bound separately. Consider first A 2 . By combination of $7 C C Q\i 2 an d the identity 
W&K-VWI = ll^-^llSln + lbll 2 ,!^ we deduce E,\\& k *-ip\\l ||<^ fc * -ip\\l+\\<p\\lP(n c 1/2 ). 
Since (w/7) is monotonically decreasing, the last estimate together with (A. 12) in Lemma 
A.2 implies for all 92 G 

E ll^fc* - < 4DdpR* n max (l, ^ max ^ ) +pP(S2? /2 ) (A.3) 

by employing the definition of R* n . 

Consider A\. From the identity [g] k , - [T] k * [p m )ki = [ b \k + \ s \k follows 
[<PK ~ = + PlgCPk " ^K^N^k + [%}ln 
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By making use of this identity we decompose A\ further into two terms 
n\$K - $ K \\l < 2E[|| [di&g(uj)]ll 2 [T]~} {[B] K + [S\ K }fl n ] 



,l/2 m _l- - 



+ 2E[||[diag( W )]^[T]^[S] fc;; [T] fc , {[B] K + [S] K }\\ 2 t n ] =: 2{A 11+ A 12 } (A.4) 



which we bound separately. In case of A\\ we employ successively (A. 11) in Lemma 
A.l with M := [diagfw)]^ 2 ^]" 1 , the elementary inequality ti(A t B t BA) \\A\\ 2 tx(B t B) 
valid for all (k x k) matrices A and -B and the extended link condition (2.7), that is, 
HtdiagfA)]^ 2 ^]- 1 !! 2 < D. Thereby, we obtain 

Efllldiaglo;)]^^]^ 1 [5]^}|| 2 ld 

< (2/n) D tr ([diag(A)]^ 1/2 [diag( W )]^[diag(A)]^ /2 ) {a 2 + V 2 F \\<p - <p K || 2 } 

h* 
K n 

= 2D{a 2 + V 2 r\\ i p- (pk *\\ 2 }Y J ^-- (A.5) 



n\j 



Consider now A12. Observe that || [diag(o;)]^ 2 [T] fc , 1 || 2 ^ Z) max^^.* ay/Aj for all T £ 



Tl '"Ti 



7^,. By employing the last inequality together with ||[T] fe » || 2 1q 1/2 ^ AD/Xk* and ||[T] fe » \\ 2 tn ^ 
n there exists a numerical constant C > such that 

E[|| [diagH]| 2 [Tg [H]^[f]^ { [B] kA + [5]^} || 2 l n ] 

^ D i™k ^-{^^-^^"^llt^*^!! 3 *!! [-»l*s.-^-[-srifcs.ll = * =t «^=»-^"'aE5|l I^ j i^ll s ll[-»]ife^.-^-[-sr]jfe^ll =a ^-«^ =a } 

< ^{4^A^ (1E|| [S]^||^) 1/2 +r,(lE|| [S]^|| 8 ) ^^(Q^) V^} [ J B]^+[ J S]^||4 ) V2 



< C max ^- Di, 4 (a 2 + T ^ - tp K || 2 )(4 J D^ + (A:;) 3 |P(^ 



where the last bound follows from (A. 8), (A. 9) and (A. 10) in Lemma A. 10. By combination 
of the last bound and (A.5) via the decomposition (A.4) there exists a numerical constant 
C > such that 



3=1 3 



Furthermore, taking into account the estimate (A. 12) from Lemma A. 2 with w = 7 and the 
definition of i?* , the last inequality implies 

E||£ fc » -£ fc *|| 2 < CDrf (a 2 + ArDdp){ADC/K + (k* n ) 3 \P(Q' i/2 )\ 1 / A }R* n . 

Finally, since Qy 2 C {||[T] fc * - [T] fc . || 2 > Afc*/(4D)}, by using the decomposition (A. 2) the 
result of the theorem follows from the last estimate and (A. 3). □ 
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Proof of Theorem 2. 6.We start our proof with the observation that under Assumption A3 

<2eXp{_ 20^ft* : " + 21ogi: " ) 

by applying successively (A. 14) in Lemma A. 3 and the estimate (&* ) 3 (nAfc* ) _1 ^ K ~ l {KiTlk* 
k _1 C- From this estimate we conclude for all n G N with (log £;*)/£;* ^ k/(280Dt] 2 Q and 
(log > -k/(40ZVO that 

(K) 12 p(\m kA -[Th A \\ 2 >^)^2, 
(Kr 1 p(\\m kA -[Th A \\ 2 > ) ^)^2. 

By employing these estimates the assertion follows now from Theorem 2.4. □ 



Illustration: estimation of derivatives. 

Proof of Proposition 2.8. Since for each ^ s ^ p we have E||/w — f( s ) || 2 ~ E||/ — /|| 2 we 
intend to apply the general result given Theorem 2.1. In both cases the additional conditions 
formulated in Theorem 2.1 are easily verified. Therefore, it is sufficient to evaluate the 
lower bound i?* given in (2.4). Note that the optimal dimension parameter fe* satisfies 

K, ~ ^Kllki ~ E/=i w «/M/) since both sequences (rfj/u)j) and (Eo<jj|<j ^t t ) are non - 
increasing. 

[p-A] The well-known approximation Ylj=i 3 r ~ ^ r+1 f°r r > implies 

n ~ (7fc*M*)E;=i^/ A i ~ (K) 2a+2p+1 ■ lt follows that k* ~ n 1 /( 2 P+ 2a+1 ) and the lower 
bound writes" R* n ~ n -(2p-2 S )/(2 P +2a+i)_ 

[e-A] Applying Laplace's Method (c.f. chapter 3.7 in Olver (1974)) we have 
n ~ (7fe*/wfc.) ESi wj/A, ~ (fc;) 2p exp(|A£| 2a ) which implies that 

A;* ~ {log(n/(logn) p / a )} 1 /( 2a ) = (\ogn) 1 ^' 2a \l + o(l)) and that the lower bound can be 
rewritten as R* n ~ (log n)~ ( - p ~ s ^ a . □ 

Proof of Proposition 2. 9. Since in both cases the dimension parameter is chosen optimal 
(see the proof of Proposition 2.8) the result follows from Theorem 2.4. □ 



Technical assertions. The following paragraph gathers technical results used in the proofs 
of this section. 

Lemma A.l Suppose that U G U a , a > and that the joint distribution of (Z, W) satisfies 
Assumption A2. If in addition (p G T'^ with V = Yl'JLi ij 1 < 00 > then there exists a constant 
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C > such that for all k G N and for all z G R k 

E|z* [B] k \ 2 ^ (1/n) ||z|| 2 a 2 , (A.6) 
E|z* [%| 2 < (1/n) ||z|| 2 r/ 2 T ||p - p k \\ 2 (A.7) 

E||[%|| 4 <^((tyn) V-fj 2 ) 2 , (A.8) 

E||[%|| 4 < C ■ (pe/n) ■ rj 2 ■ r ■ \\<p - <Pk\$) 2 , (A.9) 

E||[Hy 8 ^C-((fe 2 /n)-ry 2 ) 4 . (A.10) 

Moreover, given a (k x fc) matrix M , we have 

E||M{[£]fc + [%}|| 2 < (2/n) tr(M*M){a 2 + 77 2 r||^-^ fe || 2 }. (A.ll) 

Proof. The proof of (A.6) - (A. 10) can be found in Johannes and Breunig (2009) and 
we omit the details. The estimate (A.ll) follows by employing (A.6) and (A.7) from the 
identity \\M{[B]k + [S}k}\\ 2 = T,j=i\\ M j{[ B ]k + [S]k}\\ 2 , where Mj denotes the j-th column 
of M t , which completes the proof. □ 

Lemma A. 2 Let g = Tp and for each fceN denote p k := [r]^ 1 ^]^.. Given sequences A and 
7 satisfying Assumption Al let T G T dD and p G Tl^. For each strictly positive sequence 
w := (wj)j£j$ such that w/'j is non increasing we obtain for all k G N 

\\f ~ Pkllt < ^Ddp — max ( 1, — max ^ j (A.12) 



7fc V w fc A j 

Proof. The condition T G T d x D , that is, sup fceN || [diag(A)]J, /2 [T] fc 1 1| 2 < D and ||T/|| 2 < 

for all / G L 2 Z , together with the identity [Ej-p — pk]k = — [T] k 1 [TE k L p] k implies 

\\E k p - Pk\\l < D\\TE^p\\ 2 < Dd||i^</?||f < Dd^ l \ k p for all because (A/7) is 

monotonically non increasing. From this estimate we conclude 

\\E k p - PkWt = ||[diag(w)]£ 2 [.E fe p - p k ]k\\ 2 

< ||[diag( U ;)]i /2 [diag(A)]~ 1/2 || 2 || J E;^ - <Pk\\l < Ddp^ max ^. (A.13) 

- 7fe KKfc Aj 

Furthermore, since (w/j) is non increasing, we have \\E k p — pW^ ^ pwkllk for all / G F^. 
The assertion follows now by combination of the last estimate and (A.13) via a decomposi- 
tion based on an elementary triangular inequality. □ 

Lemma A. 3 Suppose that the joint distribution of (Z, W) satisfies Assumption A3. If in 
addition the sequence A fulfills Assumption Al, then for all k G N we have 

Pimkf > < 2e II p { - F ^ + 21 „ g * } . (A.14) 

Proof. The proof of the assertion can be found in Johannes and Breunig (2009) and we 
omit the details. □ 
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A. 2. Proofs of section 3 

We begin by denning and recalling notations to be used in the proof. Given ti£l 2 [0, 1] we 
denote by [u] the infinite vector of Fourier coefficients [it],- := (u,ipj). In particular we use 
the notations 



j.i 



Furthermore, let 5 be the function with Fourier coefficients \g]j := [g]j and observe that 
Eg = g. Given 1 ^ k ^ k' we have then for all i G «S& := spanj-i/'i, • • . , ipk} 

<t,fo/) w = -rr^SlI inf [T] 2 > l/n} = <i,£ fc ) w , 

i=l j=l I 1 Ijj 

■y TL k r ,-j 

n i=i j=i [J w 

k k 



i=i L J 



33 j=1 

Consider the contrast T(t) := - 2(t, &§) , for all t G L 2 [0, 1]. Obviously it follows for 
all t E Sf. that T(t) = ||t — fikWu ~ \\<Pk\\t an d, hence 

argminT(t) = (p k , Vfc > 1. (A.16) 
Then, the adaptive choice of the dimension parameter can be rewritten as 

k = argmin{T(£ fc ) + p£h(fc)} with peh(£;) := 540E[y 2 ] — . (A.17) 

Then for all 1 ^ k ^ N n , we have that T(^r) + pen(/c) ^ T(^) +pen(/c) ^ Y(c/?fc) + pen(fc), 
using first (A.17) and then (A.16). This inequality implies 

W&kWl - \\<Pk\\l «S - + Peh(^) - Pen(^), 

which together with the identities given in (A. 15) for all 1 ^ k ^ N n implies 

< II V — ¥>Jfc||S + Pem>) -pen(fc) + 2(^-^ fc ,8 ? -$ s ) a) (A.18) 

Consider the unit ball := {/ G : \\f\\u ^ 1} and, for arbitrary r > and i G <Sfc, the 
elementary inequality 

1 1 ^ 

2|(i, ^ 211*^ sup |(t, /i) w | < T\\t\\l + - sup |(t, h) w \ 2 = r\\t\\l + - y\u)j\[h]j\ 2 . 
t&B k r teBk t ^ 
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Combining the last estimate with (A. 18) and <p^ — ifk S <% v i. we obtain 



1 



W%-<P\M^ \\'P- ( Pk\\t + r\\lp % -(ph\\i+ven(k)-pen{k) + - sup |<t,%-$ 



Letting r := 1/3 it follows from \\ip^ — (fk\\t K 2 
5 



+ 2||v3 fc -ip\\l that 



< il|p-Wfe|lw+pen(*0-pen(*0 + 3 SU P \{t,$g-$g)u\ 2 - 

6 t6B t „t 



Consider the functions z? and /I with Fourier coefficients = - Y2?=i ^ Q; n}V'j(^ / i) 

and [j2]j = - 5Z£=i > Qfol^fWi) respectively, and their centered versions 

Ez/ and fj, = Jl — then we have g — g = u + fj, and 



v = v 



1 



11% 



3 „r fe r IIU ^3 

i su 



<^fc||w + pen(A;) - pen(£;) 



+ 6 sup \{t,<S> u ) u \ 2 + 12 sup \{t,$ v -$ v ) u \ 2 + 12 sup |(i,$ M + $ s -$ s ) w | 2 

2-nrn i i \i+ ^ ^F. \ i2nrr>c 



Decompose | (t, — § h 



\(t, $„ - * I/ ) w | J l{fi,} + |(t, - $,U 2 1{5^} further using 
-l 



(A.19) 



Since > l/m}l{O g } = l{Q q }, it follows that for all 1 < j < iV n we have 



[T] 



"1{[T]'. ^ 1/n} - 1 ) t{Q q } = \{T] j3 \ 2 l{U q } 



1 

< . 
4 



Hence, sup fgBft - $„) w | 2 l{O g } \ sup teBfc $^) w | 2 for all 1 ^ k < N n and 



+ 12 sup |(t,$,-^) w | 2 l{^} + 12 sup |(i,$ M + $ 9 -^) w | 2 . (A.20) 



Define Af := max^^i [T]jj| 2 , := maxi< S j< sfc (a;j)vi/| [T] 
and 5l := kAj {log(rJ V (k + 2)) / log(/c + 2)}. Then, it is easily seen that 

log 3 



(A.21) 



with ^ = (log 3d) /(log 3). Moreover, define the event f2 gp := f2q n f2 p where 9 is given in 
(A. 18) and 



SI 
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Observe that on fL we have (1/2) < A fc < (3/2) for all 1 < k < N n and hence 
(1/2) [Al V (jfe + 2)] < [A fc V (fe + 2)] < (3/2) [A^ V (k + 2)], which implies 

n /2^A r f log[A * v (fc + 2)] Vi log2 log(fc + 2) 

WJ H log(A; + 2) A log(& + 2) log(Af V [Jfe + 2]) 

< 4 < (3/2)^( iog(A ^ v[fc : 2]) ) fi + log3/2 log( " + 2) 



log(fe + 2) A log(* + 2) log(A^ V [k + 2}) J " 

Using log(A^ V (A; + 2))/log(fc + 2) ^ 1, we conclude from the last estimate that 

61/10 <(log3/2)/(21og3)# < (1/2)#[1 - (log2)/log(fc + 2)] < S k 

< (3/2)<#[l + (log3/2)/log(A; + 2)] < 3#. 

Recall that pen(fc) = 540 E[Y 2 ] fan -1 , we define 

pen(A) := 54E[Y 2 ] fin" 1 , 
then it follows that on fl q we have 

pen(fc) < pefi(A;) ^ 30pen(/c) V 1 sC k < JV n . 
On = fi g n fip, we have ^ iV n . Thus, 



pen(fc V k) + pen(fc) — pen(A;)J ^ (^pen(fc) + pen(fc) + peh(fc) — pen(fc)J l{Sl gp } 

^31pen(fc) \fl^k^N n . (A.22) 

Furthermore, we obviously have Afc n AjT for every 1 k ^ N n , which implies S k ^ n(l + 
logn)S^. Consequently, pen(&) ^ 540E[Y 2 ] n (1 + logn), because S^/n ^ dQSk/n ^ ^Crf 
for all 1 ^ k ^ N n by (A. 21) and the definition of N n . On n Cl p , we have k ^ N n and 
hence pen(fc V k) ^ pen(iV n ) 54E[Y 2 ], which implies 

(pen(A;Vfc) + pen(A;)-pen(fc))l{0^nOp} < 594 E[Y 2 ] n (1 + logn)l{Q£ n %}. (A.23) 

We note further that for all </? G J 7 ^ with ^ZjeN'Yj" = T < oo and for all z G [0, 1] we have 
|^(z)| 2 P^2,j^nlJ l '4' 2 j{ z ) ^ 2pr by employing Cauchy-Schwarz inequality. Thereby, given 
m ^ 1 such that E,U 2m \W < o- 2m , it follows 

E[Y 2m |W] < 2 2m (2pr + a 2 )" 1 and, hence E[Y 2m ] < 2 2m (2/T + a 2 ) m . (A.24) 

At the end of this section we will prove three technical Lemmata (A. 5, A. 8 and A. 7) which 
are used in the following proof. 

Proof of Theorem 3.4- The proof is based on the decomposition 

n\n - <p\\i = n\h - vwmtow} + n\n - villus n ^} + n\n - <p\\mn&- 
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Below we show that there exist a numerical constant C such that for all n ^ 1 and all 
we have 



E 



\(p^ - (^||^l{figp} ^ CU\(p - cpk\\l + pen(k) + d/jmaxj 

(" 



UJ: 



mm 



(2 P r + a"-f p P r + <r 2 + ljrfo „ ( (>r + <r 2 )0 + v„ lz 

H 1 2j 



T/2 



(A.25) 



El 



% - ^llS 1 !^ n Q P } < C<^ || <^ - <p fc || 2 + dpmax <^ min (l, 

( (2 P r + a 2 )Cd + Vu\ Z 



+ 



(2pT + a 2 + ifd 



n 



C 



OS 



n\j' 



+ 1 



E||^-^||^l{^}^-(2pr + a 2 ) 



(A.26) 



(A.27) 



The desired upper bound follows by using (A. 21), that is, pen(fc) ^ 54 (2pT + a 2 ) dCd^k n , 
and by employing the monotonicity of w/7, that is \\<p — </?£,• || 2 ^ p^h I Ik- 
Proof of (A.25). By employing the estimate (A. 28) and pen(/c) := 54E[Y 2 ] S^n^ 1 we have 



\<h - <p\L < ollv 3 - VfcL + 9 su p 



E[F 2 1 <5 T 



fcvfc 



tee. 



+ pen(/c V A;) + pen(A;) — peh(A;) 
+ 12 sup \{t,$ v -$ v ) u \ 2 l{n c Q } + l2 sup |(t,$ M + $ 9 -$ s v i:> 



teB, 



teB,. 



-g/u>\ 



and, hence using that k ^ N n on Q p we obtain for all 1 ^ A; < N l n 

1 5 / Efy 2 i5 T \ 

3 \\th - <f\\iH^ q p} < 3 llv - vfcllS + 9 1] ( g up l(t, 2 - 6 J fc J 

k=l ^ teB k 11 / 



5 .. Il2 

< gllV - P*L 



+ 12 sup |(t,* M + * fl -* p )a;| + (pen(/eVfc)+pen(fc)-pen(fc))l{O g p} 



+ 9gfsup|(t,^| 2 -6^1^') 



+ 12 sup |(t,^ + $ 9 -^) w | 2 + 31pen(/c), 

where the last inequality follows from (A. 22). The second term is bounded by employing 
Lemma A. 5. In order to control the third term, apply Lemmata A. 6 and A. 7. Consequently, 
combining these estimates proves inequality (A.25). 



Proof of (A.26). On Q c q n ftp, we have N l n < N n ^ JV n . Applying (A. 28), it follows from 
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(A.23) that for all 1 < k < N l n 

-||^ - ^ n ^ - <^|| 2 + 9^ sup \(t, ^U 2 - 6^^A 
13 6 k =l ^* eB fe 

+ 594E[y 2 ]n(! + logn)l{fi£ n J2 P } 

+ 12 sup |(t,$ i ,-^) w | 2 l{^} + 12 sup |(t,$ M + $ 9 -$ 9 ) w | 2 . 

Due to Lemmas A. 5, A. 7, A. 8, and A. 9, there exists a numerical constant C such that 

E||% - </>|| 2 1{^ n n p } < c| ||p - <^|| 2 + dpmax min (l, Jj-] 



+ (2 P r + a 2 )n(l + logn)P[^] + dP[^] 1/2 + ^ ^° 
+ (2 P r + a 2 + l)dCd s /(2pr + a 2 )C rf + 



n 



n 



\ V 2 

\ u\z 



Employing Lemma A. 9 now proves (A. 26). 

Proof of (A. 27). Let (pj. := Ylj=i[ i P]j^-{[ r ^\jj ^ l/ n }' l Pj- ^ is easy to see that \\tpk — tpk\\ 2 ^ 

\\<PK -$k>\? for a11 fe' < k and ll^fc-^ll 2 «S IMI 2 for all fc ^ 1. Thus, using that 1 ^ k ^ N%, 
we can write 

e||^ - ^|| 2 t{n;} < 2{E||^ - ^|| 2 1{^} + e||^ - <^|| 2 1{^}} 
< 2|e||^« - N u\\mn c p } + p[n c p ] 

Moreover, since sup^ E,Y 4 tpj(W) < 64(2pr + a 2 ) 2 and Eipj(W)ipj(Z) < 16 due to (A.24), 
it follows from Theorem 2.10 in Petrov (1995) that 



m\vN X - N}t \\mn c p } 

N u 

< 2n5> J -{E(b] i - [T]^^) 2 !!^} + E([T] i ,M, - [ry<^) 2 l{^}} 

i=i 

< 2r»{ [E (Uj ~ Mi)*] V2 P[^] 1/2 

i=i 

+ 5]a; i |Hf[E([T] ii - [T],,) 4 ] 1/2 P[^] 1/2 } 
i=i 

< Cn{n (2pT + a 2 ) + (fT^MI* )} P[^] 1/2 , 



where we used that Wj ^ n(maxi^j^N% u)j) ^ n2 due to Definition 3.2 (ii). Since 

(w/t) is non-increasing, (A. 27) follows from Lemma A. 10, which completes the proof. □ 
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Illustration: estimation of derivatives 

Proof of Proposition 3.6. In the light of the proof of Proposition 2.8 we apply Theorem 3.4, 
where in both cases the additional conditions are easily verified (cf. Remark 3.3) and the 
result follows by an evaluation of the upper bound. Note further that (w/A) is in both cases 
non decreasing, and hence the second term in the upper bound of Theorem 3.4 is always 
smaller than the first one. 

In case [fs] we have N l n ~ (n/(logn)) 1 /( 2a+2s+1 ) and k* := n V(2a+2 P +i)_ Note that fc * < N ^ 
Thus, the upper bound is of order 0{{k* n )~ 2 ( p -^ + n" 1 ) = 0( n -2(p^)/(2a+2 P +i))_ 
In case [is] we have N l n ~ {log(n/(logn)^ +2a+1 )/( 2a ))} 1 A 2a ) = (logn) x /( 2a )(l + o(l)) ~ k* n . 
Thereby, the upper bound is of order 0((k^)^ 2 ^ s ^ + n _1 ) = 0((log n)~( p ~ s ^ a ), which 
completes the proof. □ 



Technical assertions. In the proof of Lemma A. 5 below we will need the following Lemma, 
which can be found in Comte et al. (2006). 

Lemma A. 4 (Talagrand's Inequality) Let Tx,...,T n be independent random variables and 
u n( r ) = (V n ) Y2i=l [ r (^0 ~~ ^[ r (^i)]] > f or r belonging to a countable class 1Z of measurable 
functions. Then, 

E[sup K(r)| 2 - 6H 2 ] + SC C ( - exp(-{nH 2 /6v)) + % exp{-K 2 (nH 2 /Hi))) 
ran \n n z ) 

with numerical constants K 2 = (v2 — l)/(21\/2) and C and where 



supHrlloo ^ Hi, E 



supK(rj 



1 n 

< H 2 , sup - > Var(r(T;)) < v. 



1=1 



Lemma A. 5 There exists a numerical constant C > such that 



E E sup|(t,* 



k=l 



|2 6E[Y 2 ]<f 



n 



< -\(2pT+a 2 +l)dC d ^ 
n 



V 2 
v u\z 



where £(•) is the function from Definition 3.1 
Proof. For t £ S k define the function rt(y, w) := Y2j=i 

UjVl{\v\ < n^M-HM^^ 1 , then 
it is readily seen that (t, § u ) w = \ Y2=l r t{Y k , W k ) - M[r t {Y k , W k )}. 

Next, we compute constants Hi, H2, and v verifying the three inequalities required in 
Lemma A. 4. Consider Hi first: 



su PFt|loo 

ieBfc y,w 



su-pY,", {VH\V\ <" 1/3 }Pj#Vi 



w 



< 2n 2 ^5l =: Hi 



Next, find Ho. Notice that 



E 



k 

[sup \{t,$ v ) u f] = ~^>i|[ry- 2 ¥ar(yi{|y| < n 1 ^}^)) 



fc 



1 A T 
< -^^l^r 2 HH Y2 \W]^j(W) 2 ] < 2E[Y 2 ]-^ =: ff. 
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As for v, we note that due to (A. 24) for all (p G the condition U G W<j, i.e., E[/ 2 |l^ < cr 2 , 
implies EY 2 |W < 2(2pT + a 2 ), and hence 



sup Var(r 4 (Y, W)) < sup E 



sup E 

t&B k 



< 2(2pr + a 2 ) sup ^ L r ^ rJr E[-0j(T^)^(^)] 



k 

^ 2(2pr + a 2 ) max sup Vw 7 [i] 2 < 2(2pr + cr 2 )A 



T 

k =■ v > 



By employing Lemma A. 4 we conclude 

N, 

-e[( 



/ 

Ve sup 

* — ' L \ + C K> 



fc=l 



|2 6E[y 2 ]<f 



n 



E[y 2 ] ^(2pr + a 2 ) AT 



r? 



E 



j e[y 2 ] 



A fc exp 



E[y 2 



6(2pr + o- 2 



+ n 2 / 3 exp (-K 2 VW 2 !^ 6 ) E ^ • 

fc=l n J 



The definition of N n together with (A. 21) implies Ylk=i ^ifcY 77,2 ^ Cd- Thereby, using (A. 21), 
<C dr^ and the function £ given in Definition 3.1, there exists a numerical constant 
C > such that 



Ve[( sup|(t,*„ 



fc=l 



6E[y 2 ]<5 T 



n 



C 



2l ^ v /(2 /0 r + a 2 )C a 



n 



E[y 2 ] 



Moreover, we have E[F 2 ] sC 2(2pf + a 2 ) and inf^P E[Y 2 ] ^ inf v6£ a E(<p(Z) + C/) 2 > 



E(C/ - E[C/|Z]) 2 = E[¥ar(C/|Z)] = V^ |z , which implies the result. 
Lemma A. 6 For every n G N we have 



□ 



E 



sup 

tEB Nn 



< 2 9 (2pr + C T 2 ) 4 n- 1 . 

Proo/. Since [/% = - E[/% and Var[/2]j < n _1 Ey 2 l{|F| > n l ^}tp 2 (W), it is easily 
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seen that 
E 



E[y 4 |W]E[l{|Y| > n 1/3 }\W] $(W) 



1/2 



1 Nn 
sup Kt,®^}^ 2 ^nVujVar[/i] 

^E 

3=1 

Moreover, given m = 6 we have E[Y 12 |W] < 2 12 (2pT + cr 2 ) 6 for all ip G .F-y and U G Z4 
due to (A.24) and, hence by Markov's inequality E[1{|Y| > n 1 / 3 }]^] < 2 l2 (2pT + a 2 fn~ A . 
Combining these estimates, we obtain 

jv„ 

SUp \{t,$fj,) u 
t&B Nn 



E 



^E 

3=1 

The result follows now from N n n. 



2 8 (2pr + ^) 4 n-^)(W) 



2\4 -2 



T ) N n {2pT + o- 2 ) 4 n 



□ 



Lemma A. 7 There is a numerical constant C > suc/i that for all ip G J 7 ^ and every 
k, n G N 

E 



SUp | (t, <f>g-$ 

t£B k 



g/ui\ 



< Cdpmax < — min ( 1, 



Proof. Firstly, as y G J 7 ^, it is easily seen that 



SUP \(t,$g ~ $g) u \ 

t£B k 



E 

where Rj is defined by 



i?j := 



[T] 



(A.28) 



The result follows from Ei? 2 ^ Cdmin (l, ^-), which can be shown as follows. Consider 
the identity 

2 



Eli^l 2 = E 



[T] 



j.i 



1 



l{[T\ Sj > 1/n} 



+ P[[T] 3j < 1/n] =: Rj + R 1 / . (A.29) 



Trivially, R 1 / < 1. If 1 < 4/(n [T]^.), then obviously < 4/(n[r]| ? .) < 4d/(nXj). Other- 
wise, we have 1/n < [T] 2 -/4 and hence, using Tchebychev's inequality, 



i^f < P[im,,- - [r]^] > Itr^-i /2] ^ 



4 Var([T], 



33 ' 



ijj L JJJ 



[T] 



<: 



16 16d 



j.j 



where we have used that Var([T] ■ ■) ^ 4/n for all j. Combining both estimates we have 
R) < 16d min (l, ^jt). Now consider Rj. We find that 



Rl = M 



1 i{\r\)i>m 



[T] 



< n¥ar([T] ij ) < 4. 



(A.30) 



jj 
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Using that E[|[T] ■ ■ — [T]^! 4 ] cjn 2 for some numerical constant c > (cf. Petrov (1995), 



Theorem 2.10), there exists a numerical constant c > such that 



R! € E 



\[T] n ~[TU 



[T] 



1{[T] .. > 1/n} 2 



iTO-TOf , to 



+ 



2nE[|[T] ii -[T] ii | 4 ] 2 ¥ar([T] 



[T] 



+ 



cd 



< 



jj 



TO " n TO " fiA. 



jj 



Combining with (A. 30) gives i?j ^ Cd min|l,^-| for some numerical constant C > 0, 

□ 



which completes the proof. 
Lemma A. 8 There is a numerical constant C > such that 



E 



sup \(t,$ v -$ v ) a l{n c q }\ 2 



< Cd(P[ffl) (1/2) . 



Proof. Given with Rj from (A. 28) we begin our proof observing that 



E 



sup \(t,$ v - $ v )ui{n c g }\ 2 

teB Mm 



,•=1 L J JJ 

^ET^( E [Mj] E ^j]) 1/4p [^ 1/2 ' 

j=l [ J JJ 

where we have applied Cauchy-Schwarz twice. By Petrov's inequality, there exists a numer- 
ical constant c > such that ^[[f]f] ^ cn _4//3 and hence, because d5k ^ Z)^=i 



E 



sup \{t,$„ - $„) U 1{Q*}\' 2 



^ P[n°]V 2 d6 k max (E[i^]) x 



/4 



In analogy to (A. 29), we decompose the moment of Rj into two terms 
TO ~ TO 



E[iE] = E 



TO 



l{[r] .. ^ l/n} 



+ p[TO <1 H 



which we bound by a constant using Petrov's inequality. This completes the proof. □ 



Lemma A. 9 For the event fl q defined in (A.19), we have P[fi£] ^ 2(2016d/Ai) 7 n e 
Proof. Consider the complement of Vt q given by 

TO 



3 1 ^ j ^ N n : 



TO 



i 



>2 V [T]l<l/n\. 



It follows from Assumption A4 (i) that [T] 2 -- > 2/n for all 1 < j < A n . This yields 



P(^)^ P 



[T] 



jj 



[T] 



jj 



> 
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From Hoeffding's inequality follows 

12 



/ n\T] 2 -\ 

n[T] 3] /[T\ 00 - 1| > 1/3] < 2 exp [ - -^J : 



which implies the result by definition of N n . □ 
Lemma A. 10 Consider the event O p defined in (A. 19). Then we have 

7 



4 (^*) n~\ VOL 



Proof. Let 0/ := {N l n > N n } and tt n := {N n > N n }. Then we have £l c p = 0/ U Q H - 

\[Th\ 2 > 4(logn) 



Consider 17/ first. By definition of iV' , we have that min l<1< jyi , Jrr-j^j 

- ^ ^f^, which 



implies 



— 2 



{iV n <JV}c , " < 

m 



c U {n?H^ 1 / 2 } c U {i[nypiii-ii>i/2}. 



Therefore, 0/ C Ui<|j|<jv„ { I Mj/M j ~ 1| > V 2 }; since N l < AT n . Hence, as in (A.21) 
applying Hoeffding's inequality together with the definition of N n gives 



^E^-^^^V" (A.31) 



1 PI 2 ■ 

Consider SI 77. Recall that ^ max| 3 -|^jy n |j|( CJ " vl due to Assumption A4, and hence 

Hoeffding's inequality together with the definition of N n gives P[f2/j] ^ 2(2016<i/Ai) 7 n _6 , 
which by combining with (A.31) implies the result. □ 
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